perf: avoid O(N^2) exiting-branch checks in CodeFolding#8599
perf: avoid O(N^2) exiting-branch checks in CodeFolding#8599Changqing-JING wants to merge 5 commits intoWebAssembly:mainfrom
Conversation
1dae3f3 to
66dff99
Compare
daf81f7 to
f263f08
Compare
| // efficient bottom-up traversal. | ||
| bool hasExitingBranches(Expression* expr) { | ||
| if (!exitingBranchCachePopulated_) { | ||
| populateExitingBranchCache(getFunction()->body); |
There was a problem hiding this comment.
Looks like this still scans the entire function. I suggest that we only scan expr itself. That will still avoid re-computing things, but avoid scanning things that we never need to look at.
This does require that the cache store a bool, so we know if we scanned or not, and if we did, if we found branches out or not. But I think that is worth it - usually we will scan very few things.
There was a problem hiding this comment.
The per-expression cache would still be O(N^2) in the nested block case. AssemblyScript GC emits __visit_members with deeply nested blocks + br_table, where the nesting level equals the number of classes (4000+ in real apps). Each nested block gets queried by optimizeTerminatingTails, and each query walks its overlapping subtree independently, giving O(N + (N-1) + ... + 1) = O(N^2) total work even with the cache.
We also cannot reuse a child's cached bool to compute a parent's result, because knowing "child has exiting branches" does not tell us which names exit -- the parent may define/resolve some of them. To compose results bottom-up, we would need to store the full set of unresolved names per expression. I benchmarked that approach (storing unordered_map<Expression*, unordered_set> and propagating name sets upward), but the per-node set allocation overhead on millions of nodes made -Oz significantly slower than the baseline (~13min vs ~5min).
The whole-function scan avoids both issues by computing all results in a single O(N) pass using only integer counters, with no per-node name storage.
There was a problem hiding this comment.
We also cannot reuse a child's cached bool to compute a parent's result [..] I benchmarked that approach (storing unordered_map<Expression*, unordered_set> and propagating name sets upward), but the per-node set allocation overhead on millions of nodes made -Oz significantly slower than the baseline (~13min vs ~5min).
What is the baseline here? (is it before this PR, or the PR's current state)
There was a problem hiding this comment.
Current main costs 9min
Pr current stauts cost 5min
The per-node set allocation cost 13min
There was a problem hiding this comment.
I see, thanks. Ok, it might really make sense to scan the whole function then, in a fast way, rather than less code in a slower way.
kripken
left a comment
There was a problem hiding this comment.
Looks good but I'll run some local fuzzing before landing.
|
Unfortunately I see opposite results locally. I tried the two Dart files linked here: https://chromium-review.git.corp.google.com/c/emscripten-releases/+/7769309 I measured like this:
The seconds elapsed regressed, though that might in theory be due to noise. The # of instructions and branches is extremely stable though, and they regress by 3-4%. Perhaps you can take a look at the larger of those two Dart files and see if you get the same issue locally? |
733935f to
b90aee7
Compare
|
Thanks for the feedback! My dart test resultI tried to run the dart case on my laptop, I have run it for So the dart test case seems too small to test this PR. It's very hard to measure 3% regression on it. My research reportI reworked the approach to avoid the conservative on-demand cache: pre-fill whole-function scan Instead of on-demand per-query walks, do a single whole-function walk upfront in
Pre-fill is ~25% faster on the pathological case because it walks the tree exactly once with a single set of transient name sets, while on-demand creates and destroys name sets per query and copies cached name sets from prior walks into new ones. On the other hand, pre-fill pays an upfront cost for every function even when hasExitingBranches is never called, which can regress normal workloads (3-4% on dart). The on-demand version has zero cost when the cache isn't needed, at the expense of being slower on the worst case. Now both the pre-fill version and on-demand version are better than main in test3.wasm case. I've pushed the on-demand version. Would you prefer the the on-demand version instead, or is pre-fill approach acceptable? Want me to adjust anything? Attachments:Run 1: Performance counter stats for 'build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding' (10 runs):
1,247,751,877 task-clock # 1.915 CPUs utilized ( +- 2.05% )
60 context-switches # 48.086 /sec ( +- 3.11% )
10 cpu-migrations # 8.014 /sec ( +- 7.34% )
35,331 page-faults # 28.316 K/sec ( +- 0.01% )
<not supported> cycles
0.6516 +- 0.0121 seconds time elapsed ( +- 1.86% )Run 2: Run3: Run 4: taskset -c 0-3 perf stat -r 10 build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding
event syntax error: 'topdown-retiring/metric-id=topdown!1retiring/,INT_MISC.CLEARS_COUNT/m..'
\___ Bad event or PMU
Unable to find PMU or event on a PMU of 'topdown-retiring'
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
Performance counter stats for 'build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding' (10 runs):
1,175,147,864 task-clock # 1.776 CPUs utilized ( +- 1.59% )
56 context-switches # 47.654 /sec ( +- 3.36% )
8 cpu-migrations # 6.808 /sec ( +- 10.21% )
35,333 page-faults # 30.067 K/sec ( +- 0.01% )
<not supported> cycles
0.66184 +- 0.00673 seconds time elapsed ( +- 1.02% )Run 5: taskset -c 0-3 perf stat -r 10 build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding
event syntax error: 'topdown-retiring/metric-id=topdown!1retiring/,INT_MISC.CLEARS_COUNT/m..'
\___ Bad event or PMU
Unable to find PMU or event on a PMU of 'topdown-retiring'
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
warning: no output file specified, not emitting output
Performance counter stats for 'build/bin/wasm-opt dart-flute-complex.unopt.wasm -all --code-folding' (10 runs):
1,327,088,464 task-clock # 1.842 CPUs utilized ( +- 2.24% )
57 context-switches # 42.951 /sec ( +- 3.34% )
9 cpu-migrations # 6.782 /sec ( +- 8.83% )
35,332 page-faults # 26.624 K/sec ( +- 0.00% )
<not supported> cycles
0.7204 +- 0.0122 seconds time elapsed ( +- 1.70% ) |
|
It looks like you're on a machine where
Here is what I see without this PR on that Kotlin file (with And with this PR: The regression is significantly larger than the noise. |
|
@kripken |
|
I don't see any code pushed since my comment here which measured a Kotlin regression: Which commit addresses that comment? |
There was a problem hiding this comment.
@kripken Thank you for reminder.
It's strange, in my vscode I confirmed the commit was pushed. I also can see it in the "Files Changed"
I also awared that github web gui didn't show this activity, maybe some github bug here.
To avoid confusion, I just pushed this commit with another id again. Now you can see the commit 94ae0be.
733935f to
94ae0be
Compare
Co-authored-by: Copilot <copilot@github.com>
| // transiently (moved from children, erased after merge). Only the root's | ||
| // name set is persisted. Already-cached subtrees are skipped via scan(), | ||
| // and their cached names are merged in precisely. | ||
| bool populateExitingBranchCache(Expression* root) { |
There was a problem hiding this comment.
The returning of bool here is a little odd and non-obvious, I think (it returns if the root has exiting branches, not whether we populated the cache or anything like that). How about removing the result, and from hasExitingBranches(), populating if necessary and then reading the cache?
There was a problem hiding this comment.
Another option might be for this to return a const std::unordered_set<Name>& for the root. That would be unambiguous, together with a comment that explains it is an optimization to avoid another lookup after?
Follow up PR of #8586 to optimize CodeFolding
optimizeTerminatingTailscallsEffectAnalyzerper tail item, each walking the full subtree. On deeply nested blocks this is O(N^2).Replace the per-item walks with a single O(N) bottom-up
PostWalker(populateExitingBranchCache) that pre-computes exiting-branch results for every node, making subsequent lookups O(1).Example: AssemblyScript GC compiles
__visit_membersas abr_tabledispatch over all types, producing ~N nested blocks with ~N tails. The old code walks each tail's subtree separately -- O(N^2) total node visits. With this change, one bottom-up walk covers all nodes, then each tail lookup is O(1).benchmark data
The test module is from issue #7319
#7319 (comment)
In main head
time ./build/bin/wasm-opt -Oz --enable-bulk-memory --enable-multivalue --enable-reference-types --enable-gc --enable-tail-call --enable-exception-handling -o /dev/null ./test3.wasm real 9m16.111s user 35m33.985s sys 0m51.000sIn the PR